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Abstract. We study the computational complexity of basic decision problems for one-counter simple 
stochastic games (OC-SSGs), under various objectives. OC-SSGs are 2-player turn-based stochastic 
games played on the transition graph of classic one-counter automata. We study primarily the termination 
objective, where the goal of one player is to maximize the probability of reaching counter value 0, 
while the other player wishes to avoid this. Partly motivated by the goal of understanding termination 
objectives, we also study certain "limit" and "long run average" reward objectives that are closely related 
to some well-studied objectives for stochastic games with rewards. Examples of problems we address 
include: does player 1 have a strategy to ensure that the counter eventually hits 0, i.e., terminates, almost 
surely, regardless of what player 2 does? Or that the lim inf (or lim sup) counter value equals oo with a 
desired probability? Or that the long run average reward is > with desired probability? We show that 
the qualitative termination problem for OC-SSGs is in NP n coNP, and is in P-time for 1-player OC- 
SSGs, or equivalently for one-counter Markov Decision Processes (OC-MDPs). Moreover, we show that 
quantitative limit problems for OC-SSGs are in NP n coNP, and are in P-time for 1-player OC-MDPs. 
Both qualitative limit problems and qualitative termination problems for OC-SSGs are already at least 
as hard as Condon's quantitative decision problem for finite-state SSGs. 

1 Introduction 

There is a rich literature on the computational complexity of analyzing finite-state Markov decision pro- 
cesses and stochastic games. In recent years, there has also been some research done on the complexity of 
basic analysis problems for classes of finitely -presented but infinite-state stochastic models and games whose 
transition graphs arise from decidable infinite-state automata-theoretic models, including: context-free pro- 
cesses, one-counter processes, and pushdown processes (see, e.g., |8|). It turns out that such stochastic 
automata-theoretic models are intimately related to classic stochastic processes studied extensively in ap- 
plied probability theory, such as (multi-type-)branching processes and (quasi-)birth-death processes (QBDs) 
(see 1517121 ). 

In this paper we continue this line of work by studying one-counter simple stochastic games (OC- 
SSGs), which are turn-based 2-player zero-sum stochastic games on transition graphs of classic one-counter 
automata. In more detail, an OC-SSG has a finite set of control states, which are partitioned into three types: 
a set of random states, from where the next transition is chosen according to a given probability distribution, 
and states belonging to one of two players: Max or Min, from where the respective player chooses the next 
transition. Transitions can change the state and can also change the value of the (unbounded) counter by at 
most 1 . If there are no control states belonging to Max (Min, respectively), then we call the resulting 1 -player 
OC-SSG a minimizing {maximizing, respectively) one-counter Markov decision process (OC-MDP). 
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Fixing strategies for the two players yields a countable state Markov chain and thus a probability space 
of infinite runs (trajectories). We focus in this paper on objectives that can be described by a (measurable) 
set of runs, such that player Max wants to maximize, and player Min wants to minimize, the probability of 
the objective. The central objective studied in this paper is termination: starting at a given control state and 
a given counter value j > 0, player Max (Min) wishes to maximize (minimize) the probability of eventually 
hitting the counter value (in any control state). 

Different objectives give rise to different computational problems for OC-SSGs, aimed at computing 
the value of the game, or optimal strategies, with respect to that objective. From general known facts about 
stochastic games (e.g., Martin's Blackwell determinacy theorem 0~3]), it follows that the games we study 
are determined, meaning they have a value: we can associate with each such game a value, v, such that for 
every s > 0, player Max has a strategy that ensures the objective is satisfied with probability at least v - e 
regardless of what player Min does, and likewise player Min has a strategy to ensure that the objective is 
satisfied with probability at most v+s. In the case of termination objectives, the value may be irrational even 
when the input data contains only rational probabilities, and this is so even in the purely stochastic setting 
without any players, i.e., with only random control states (see Q). 

We can classify analysis problems for OC-SSGs into two kinds: quantitative analyses: "can the objective 
be achieved with probability at least/at most p" for a given p e [0, 1]; or qualitative analyses, which ask 
the same question but restricted to p e {0, 1}. We are often also interested in what kinds of strategies (e.g., 
memoryless, etc.) achieve these. 

In a recent paper, [2 |, we studied one-player OC-SSGs, i.e., OC-MDPs, and obtained some complexity 
results for them under qualitative termination objectives and some quantitative limit objectives. The prob- 
lems we studied included the qualitative termination problem (is the maximum probability of termination 
= 1 ?) for maximizing OC-MDPs. We showed that this problem is decidable in P-time. However, we left open 
the complexity of the same problem for minimizing OC-MDPs (is the minimum probability of termination 
< 1 ?). One of the main results of this paper is the following, which in particular resolves this open question: 

Theorem 1. (Qualitative termination) Given a OC-SSG, Q, with the objective of termination, and given an 
initial control state s and initial counter value j > 0, deciding whether the value of the game is equal to 1 is 
in NP n coNP. Furthermore, the same problem is in P-time for 1 -player OC-SSGs, i.e., for both maximizing 
and minimizing OC-MDPs. 

Improving on this NP n coNP upper bound for the qualitative termination problem for OC-SSGs would re- 
quire a breakthrough: we show that deciding whether the value of an OC-SSG termination game is equal to 
1 is already at least as hard as Condon's [5] quantitative reachability problem for finite-state simple stochas- 
tic games (Corollary [T]). We do not know a reduction in the other direction. We furthermore show that if 
the value is 1 for a OC-SSG termination game, then Max has a simple kind of optimal strategy (memory- 
less, counter-oblivious, and pure) that ensures termination with probability 1, regardless of Min's strategy. 
Similarly, if the value is less than 1, we show Min has a simple strategy (using finite memory, linearly 
bounded in the number of control states) that ensures the probability of termination is < 1-6 for some 
positive 6 > 0, regardless of what Max does. We show that such strategies for both players are computable 
in non-deterministic polynomial time for OC-SSGs, and in deterministic P-time for (both maximizing and 
minimizing) 1 -player OC-MDPs. We also observe that the analogous problem of deciding whether the value 
of a OC-SSG termination game is is in P, which follows easily by reduction to non-probabilistic games. 

OC-SSGs can be viewed as stochastic game extensions of Quasi-Birth-Death Processes (QBDs) (see II7I2II ). 
QBDs are a heavily studied model in queuing theory and performance evaluation (the counter keeps track of 
the number of jobs in a queue). It is very natural to consider controlled and game extensions of such queu- 
ing theoretic models, thus allowing for adversarial modeling of queues with unknown (non-deterministic) 
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environments or with other unknown aspects modeled non-deterministically. OC-SSGs with termination 
objectives also subsume "solvency games", a recently studied class of MDPs motivated by modeling of a 
risk-averse investment scenario (TJ. 

Due to the presence of an unbounded counter, an OC-SSG, Q, formally describes a stochastic game with 
a countably-infinite state space: a "configuration" or "state" of the underlying stochastic game consists of a 
pair (s, j), where s is a control state of Q and j is the current counter value. However, it is easy to see that 
we can equivalently view Q as a finite-state simple stochastic game (SSG), < H, with rewards as follows: H 
is played on the finite-state transition graph obtained from that of Q by simply ignoring the counter values. 
Instead, every transition t of H is assigned a reward, r(f) e {-1,0, 1}, corresponding to the effect that the 
transition t would have on the counter in Q. Furthermore, when emulating an OC-SSG using rewards, we 
can easily place rewards on states rather than on transitions, by adding suitable auxiliary control states. 
Thus, w.l.o.g., we can assume that OC-SSGs are presented as equivalent finite-state SSGs with a reward, 
r(s) £{-1,0,1} labeling each state s. A run of < H, w, is an infinite sequence of states that is permitted by the 
transition structure, and we denote the 2-th state along the run w by w(i). The termination objective for Q, 
when the initial counter value is j > 0, can now be rephrased as the following equivalent objective for U: 

Term(j) := { w \ w is a run of 'H such that there exists m > such that £™o K w (0) - ~j } ■ 

An important step toward our proof of Theorem Q] and related results, is to establish links between this 
termination objective and the following limit objectives, which are of independent interest. For z e {-°o, oo}, 
and a comparison operator/) € {>, <, =}, consider the following objective: 

Liminf (A z) '■= { w \ w is a run of <H such that liminf Yj'Lo r(w{i))Az } ■ 

n— >oo 

We will show that if /' is large enough (larger than the number of control states), then the game value with 
respect to objective Term(j) and the game value with respect to Liminf (= -oo) are either both equal to 1, 
or are both less than 1 (Lemma 0J. We could also consider the "sup" variant of these objectives, such as 
LimSup{- -oo), but these are redundant. For example, by negating the sign of rewards, LimSupi- -oo) is 
"equivalent" to Liminf (- +oo). Indeed, the only limit objectives we need to consider for SSGs are Liminf (- -oo) 
and Liminf {- +oo), because the others are either the same objectives considered from the other player's 
points of view, or are vacuous, such as Liminf {> +oo). For both limit objectives, Liminf {- -oo) and Liminf (- +oo), 
we shall see that the value of the respective SSGs is always rational (Proposition|2]i. We shall also show that 
the objective Liminf {- +oo) is essentially equivalent to the following "mean payoff" objective (Lemma|2|i: 

Mean(> 0) := { w | w is a run of "K such that liminf YJlZn r(w(i))/n > } . 

n— >oo 

This "intuitively obvious equivalence" is not so easy to prove. (Note also that Liminf {- -oo) is certainly not 
equivalent to Mean(< 0).) We establish the equivalence by a combination of new methods and by using recent 
results by Gimbert, Horn and Zielonka Ml 11121 . Mean payoff objectives are of course very heavily studied 
for stochastic games and for MDPs (see lfT5l ). However, there is a subtle but important difference here: 
mean payoff objectives are typically formulated via expected payoffs: the Max player wishes to maximize 
the expected mean payoff, while the Min player wishes to minimize this. Instead, in the above Mean(> 0) 
objective we wish to maximize (minimize) the probability that the mean payoff is > 0. These require new 
algorithms. Our main result about such limit objectives is the following: 

Theorem 2. For both limit objectives, O e {Liminf (— -oo), Liminf '(— +oo)}, given a finite-state SSG, Q, 
with rewards, and given a rational probability threshold, p, < p < 1, deciding whether the value ofQ with 
objective O is >p (or >p) is in NP n coNP. If Q is a 1 -player SSG ( i.e., a maximizing or minimizing MDP), 
then the game value can be computed in P-time. 
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Although our upper bounds for both these objectives look the same, their proofs are quite different. We 
show that both players have pure and memoryless optimal strategies in these games (Proposition Q]), which 
can be computed in P-time for 1-player (Max or Min) MDPs. Furthermore, we show that even deciding 
whether the value of these games is either 1 or 0, given input for which one of these two is promised to be 
the case, is already at least as hard as Condon's [5| quantitative reachability problem for finite-state simple 
stochastic games (Proposition @}. Thus, even any non-trivial approximation of the value of SSGs with such 
limit objectives is not easier than Condon's problem. 

We already considered in [2 1 the problem of maximizing the probability of Limlnf(= -oo) in a OC-MDP. 
There we showed that the maximum probability can be computed in P-time. However, again, we did not re- 
solve the complementary problem of minimizing the probability of Limlnfi- -oo) in a OC-MDP. Thus we 
could not address two-player OC-SSGs with either of these objectives, and we left these as key open prob- 
lems, which we resolve here. An important distinction between maximizing and minimizing the probability 
of objective Limlnfi- -oo) is that maximizing this objective satisfies a submixing property defined by Gim- 
bert fill , which he showed implies the existence of optimal memoryless strategies, whereas minimizing the 
objective is not submixing, and thus we require new methods to tackle it, which we develop in this paper. 

Finally, we mention that one can also consider OC-SSGs with the objective of terminating in a selected 
subset of states, F. Such objectives were considered for OC-MDPs in |2j. Using our termination results in 
this paper, we can also show that given an OC-SSG it is decidable (in double exponential time) whether 
Max can achieve a termination probability 1 in a selected subset of states, F. The computational complexity 
of selective termination is higher than for non-selective termination: PSPACE-hardness holds already for 
OC-MDPs without Min (|2|). Due to space limitations, we omit results about selective termination from this 
conference paper, and will include them in the journal version of this paper. 

Related work. As mentioned earlier, we initiated the study of some classes of 1-player OC-SSGs (i.e., OC- 
MDPs) in a recent paper [2|. The reader will find extensive references to earlier related literature in |2). No 
earlier work considered OC-SSGs explicitly, but as we have highlighted already there are close connections 
between OC-SSGs and finite-state stochastic games with certain interesting limiting average reward objec- 
tives. One-counter automata with a non-negative counter are equivalent to pushdown automata restricted to 
a 1-letter stack alphabet (see Q), and thus OC-SSGs with the termination objective form a subclass of push- 
down stochastic games, or equivalently, Recursive simple stochastic games (RSSGs). These more general 
stochastic games were introduced and studied in [8], where it is shown that many interesting computational 
problems for the general RSSG and RMDP models are undecidable, including generalizations of qualita- 
tive termination problems for RMDPs. It was also established in [8| that for stochastic context-free games 
(1-exit RSSGs), which correspond to pushdown stochastic games with only one state, both qualitative and 
quantitative termination problems are decidable, and in fact qualitative termination problems are decidable 
in NP n coNP (|9|). Solving termination objectives is a key ingredient for many more general analyses and 
model checking problems for stochastic games. OC-SSGs form another natural subclass of RSSGs, which is 
incompatible with stochastic context-free games. Specifically, for OC-SSGs with the termination objective, 
the number of stack symbols, rather than the number of control states, of a pushdown stochastic game is 
being restricted to 1 . As we show in this paper, this restriction again yields decidability of the qualitative ter- 
mination problem. However, the decidability of the quantitative termination problem for OC-SSGs remains 
an open problem (see below). 

Open problems. Our results complete part of the picture for decidability and complexity of several problems 
for OC-SSGs. However, our results also leave many open questions. The most important open question for 
OC-SSGs is whether the quantitative termination problem, even for OC-MDPs, is decidable. Specifically, 
we do not know whether the following is decidable: given a OC-MDP, and a rational probability p E (0, 1), 
decide whether the maximum probability of termination is >p (or >p). Substantial new obstacles arise for 
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deciding this. In particular, we know that an optimal strategy may in general need to use different actions at 
the same control state for arbitrarily large counter values (so strategies cannot ignore the value of the counter, 
even for arbitrarily large values), and this holds already for the extremely simple case of solvency games 0] 
Theorem 3.7]. 

Outline of paper. We fix notation and key definitions in Section [2] In Section [3j we prove Theorem [2] 
Building on Secfion[3] we prove Theorem[T]in Section|4] Many proofs are in the appendix. 

2 Preliminaries 

Definition 1. A simple stochastic game (SSG) is given by a finite, or countably infinite directed graph, 
(V, =— > ), where V is the set of vertices (which we also call states), and » is the edge (also called transition) 
relation, together with a partition (V T ,V±,Vp) ofV, as well as a probability assignment, Prob, which to 
each v e Vp assigns a rational probability distribution on its set of outgoing edges. States in Vp are called 
random, states in V T belong to player Max, and states in V± belong to player Min. We assume that for every 
v 6 V there is at least one u e V such that v e — > u. We often write v u instead of Prob(v u) — x. If 
Vj_ = % we call Q a maximizing Markov decision process (MDP). IfV T = we call it a minimizing MDP. 
IfV_i_ = V T =® then we call Q a Markov chain. A SSG (or a MDP or a Markov chain) can be equipped 
with a reward function, r, which assigns to each state, v € V, a number r(v) € {-1,0, 1 }H Similarly, rewards 
can be assigned to transitions. 

For a path, w = w(0)w(l) • • ■ w(n - 1), of states in a graph, we use len(w) = n to denote the length of w. 
A run in a SSG, Q, is an infinite path in the underlying directed graph. The set of all runs in Q is denoted by 
Rung, and the set of all runs starting with a finite path w is Rung(w). These sets generate the standard Borel 
algebra on Rung. 

A strategy for player Max is a function, cr, which to each history w eV + ending in some v e V T , assigns 
a probability distribution on the set of outgoing transitions of v. We say that a strategy cr is memoryless if 
<t(w) depends only on the last state, v, and pure if cr(w) assigns probability 1 to some transition, for each 
history w. When cr is pure, we write cr(vv) = v' instead of cr(w)(v, v') = 1. Strategies for player Min are 
defined similarly, just by substituting V T with V±. 

Assume we fix a starting state s, and a pair of strategies: cr for player Max, and n for Min in a SSG, Q. 
There is a unique probabilistic measure, Pf'", on the Borel space of runs Rung, satisfying for all finite paths 
w starting in s: V'l'" (Rung(w)) - ri- = i x i where Xu 1 < i < len(w) are defined by requiring that (a) if 
w(z'-l) e Vp then w(i— 1) w(i); and (b) if w(i— 1) 6 V T then cr(w(0) • ■ • w(i— 1)) assigns x t to the transition 
w(i-l) w(0; and (c) if w{i-l) e V± then 7r(w(0) ■ • -w(i-l)) assigns x t to the transition w(i— 1) e -*w(i). 
In particular, P^' 7T (Rung(s)) = 1. In cases where Q is a maximizing MDP, a minimizing MDP, or a Markov 
chain, we denote this probability measure by P^, PJ, or P s , respectively. See, e.g., lfT31 p. 30], for the existence 
and uniqueness of the measure in the case of MDPs. It is straightforward then to establish existence and 
uniqueness of Pj'" for SSGs, by considering pairs of strategies to be one strategy. 

In this paper, an objective for a stochastic game is given by a measurable set of runs. An objective, O, is 
called a tail objective if for all runs w and all suffixes w' of w, we have w' e O <=> w e O. 

Assume we have fixed a SSG, an objective, O, and a starting state, s. We define the value of Q in s as 
Val°(s) := sup a inf n ¥^'"(0). It follows from Martin's Blackwell determinacy theorem lfl3l that these games 
are determined, meaning Val°(s) = inf n sup a ¥^'"(0). A strategy cr for Max is optimal in s if ^'"(O) > 
Val°(s) for every n. Similarly a strategy n for Min is optimal in s if ¥^'"(0) < Val°(s) for every cr. A strategy 
is called optimal if it is optimal in every state. 

3 Rewards can generally be arbitrary rational values, but for this paper we confine ourselves to rewards in (-1,0, 1). 
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An important objective for us is reachability. Given a set T c V, we define the objective Reach(T) := 
{w G /?m«£ | 3; > : w(z') e T}. The following fact is well known: 

Fact 3 (See, e.g., 47515161/ . ) For Z?of/; maximizing and minimizing finite-state MDPs with reachability ob- 
jectives, pure memoryless optimal strategies exist and can be computed, together with the optimal value, in 
polynomial time. 

3 Limit objectives 

All MDPs and SSGs in this section have finitely many states. Rewards are assigned to states, not to transi- 
tions. The main goal of this section is to prove Theorem|2] We start by proving that both players have optimal 
pure and memoryless strategies for objectives Limlnfi- -oo), Limlnf(= +oo), and Mean(> 0). The following 
is a corollary of a result by Gimbert and Zielonka, which allows us to concentrate on MDPs instead of SSGs: 

Fact 4 (See M2\ Theorem 2].) Fix any objective, O, and suppose that in every maximizing and minimizing 
MDP with objective O, the unique player has a pure memoryless optimal strategy. Then in all SSGs with 
objective O, both players have optimal pure and memoryless strategies. 

Note that the probability of Limlnfi- -oo) is minimized iff the probability of Limlnf(> -oo) is maxi- 
mized, similarly with Limlnfi- +oo) vs. Limlnf(< +oo), and Mean{> 0) vs. Mean(< 0). 

Fact 5 (See 4771 Theorem 4.5].) Let O be a tail objective. Assume that for every maximizing MDP and for 
every state, s, with Val°(s) — 1, there is an optimal pure memoryless strategy starting in s. Then for all s 
there is an optimal pure memoryless strategy starting in s, without restricting Val°(s). 

Proposition 1. For every SSG, considered with any of the objectives Limlnf(— -oo), Limlnf(— +oo), or 
Mean(> 0), both players Max and M in have optimal pure memoryless strategies. 

Proof. (Sketch.) Using Fact [4] we consider only maximizing MDPs, and prove the proposition for the ob- 
jectives listed and their complements. Note that since all these objectives are tail, a play under an optimal 
strategy, starting from a state with value 1, cannot visit a state with value < 1. By Fact|5]we may thus safely 
assume that the value is 1 in all states. We discuss different groups of objectives: 

Limlnf(- -oo), Limlnf(< +oo), Mean(< 0), Mean(> 0): The first three (with Limlnfi- -oo) also handled 
explicitly in (2|) are tail objectives and are also submixing (see [10]). Therefore, Theorem 1 of IflOll imme- 
diately yields the desired result. Mean(> 0) can be equivalently rephrased via a submixing lim sup variant. 
See Section lATl in the appendix for details. 

Limlnf(- +oo): is a tail objective, so there is always a pure optimal strategy, r, by IfTTl Theorem 3.1]. Note 
that Limlnfi- +oo) is not submixing, so Theorem 1 of IflOl does not apply. In the following we proceed 
in two steps: we start with t and convert it to a finite-memory strategjQ cr. Finally, we reduce the use of 
memory to get a memoryless strategy. 

First, we obtain a finite-memory optimal strategy, starting in some state, s. For a run w e Rung(s) and 
i > 0, we denote by r[;](w) the accumulated reward Y!j = o r ( w (J)) U P t0 ste P Observe that because t is 
optimal there is some m > and a (measurable) set of runs A c Rung(s), such that V T S (A) > \, and for all 

4 A finite-memory strategy is specified by a finite state automaton, 3{, over the alphabet V. Given w 6 V + , the value 
<t(w) is determined by the state of J{ after reading w. 
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weAwe have that the accumulated reward along w never reaches —m (i.e. inf,>o r[/](w) > -m). Since for 
almost all runs of A we have lim^o, r[/](w) = oo, there is some n > and a set B c A such that Pj(B | A) > i 
(and hence, W S (B) > |), and for all w e B we have that the accumulated reward along w reaches Am 
before the n-th step. Thus with probability at least |, a run w e Rung(s) satisfies inf ,> r[;'](w) > -m and 
maxo<,<„ r[/](w) > Am. 

We denote by T s (yv) the stopping time over Rung(s) which for every w e Rung(s) returns the least 
number i > such that either r[;](w) £ (-m, Am), or i = n. Observe that the expected accumulated reward 
at the stopping time T s is at least \ ■ Am + |(~m) = f > 0. Let us define a new strategy cr as follows. 
Starting in a state s eV, the strategy cr chooses the same transitions as t started in s, up to the stopping time 
T s . Once the stopping time is reached, say in a state v, the strategy cr erases its memory and behaves like t 
started anew in v. Subsequently, cr follows the behavior of r up to the stopping time T v . Once the stopping 
time T v is reached, say in a state u, cr erases its memory and starts to behave as r started anew in u, and so 
on. Observe that the strategy cr uses only finite memory because each stopping time T s is bounded for every 
state s. Because t is pure, so is cr. 

Now we argue that cr is optimal. Intuitively, this is because, on average, the accumulated reward strictly 
increases between resets of the memory of cr. To formally argue that this implies that the accumulated reward 
increases indefinitely, we employ the theory of random walks on Z and sums of i.i.d. random variables (see, 
e.g., Chapter 8 of ||4)). Essentially, we define a set of random walks, one for each state s, capturing the 
sequence of changes to the accumulated reward between each reset in s and the next reset (in any state). 
We can then apply random walk results, e.g., from [4 Chapter 8], to conclude that these walks diverge to oo 
almost surely. For details see Lemmas [TT1and[T0lin the appendix. 

Taking the product of the finite-memory strategy cr and Q yields a finite-state Markov chain. By analyzing 
its bottom strongly connected components we can eliminate the use of memory, and obtain a pure and 
memoryless optimal strategy, see Lemma[T2]in the appendix. 

Limlnf(> -oo): Like Limlnfi- +oo), the objective Limlnf(> -oo) is tail, but not submixing. Thus there is 
always a pure optimal strategy, r, for Limlnf(> -oo), by [11, Theorem 3.1], but Theorem 1 of [10] does 
not apply. We will prove Proposition Q] for Limlnf(> -oo) using the results for Limlnfi- +oo), and also a 
new objective, All(> 0) := {w e Rung | Vn > : £"=o r(w(j)) > 0). Let Woo and W+ denote the sets of 
states s such that Val L '"' Infi=+ca) (s) = 1, and Val AII( - 0) (s) = 1, respectively. Then, as we prove in the appendix, 
LemmaQj] for every state, s, with Vaf imrnf( - > -°° ) (s) = 1: 

3cr : ^(ReachiWac U W+)) =1 (1) 

Moreover, we prove that whenever Val A,,( --°\s) = 1 then Max has a pure and memoryless strategy cr + which 
is optimal in s for All(> 0). Indeed, observe that player Max achieves All(> 0) with probability 1 iff all 
runs satisfy it. So we may consider the MDP Q as a 2-player non-stochastic game, where random nodes 
are now treated as player Min's. In this case, Theorem 12 of [ 3 1 guarantees the existence of the promised 
strategy <x + . The proof is now finished by observing that, by Fact[3] there is a pure and memoryless strategy 
cr maximizing the probability of reaching Woo U W+. The resulting pure and memoryless strategy, opti- 
mal for Limlnf(> -oo), can be obtained by "stitching" cr together with the respective optimal strategies for 
Limlnf(= +oo) andA//(> 0). □ 

The following simple lemma is proved in the appendix. 

Lemma 1. Let M be a finite, strongly connected ( irreducible) Markov chain, and O be a tail objective. Then 
there is x € {0, 1 } such that P s (<9) = xfor all states s. 

A corollary of the previous proposition and lemma is the following: 
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Proposition 2. Let O € {Limlnfi- —oo), Limlnfi- +00), Mean(> 0)}. Then in every SSG, and for all states, 
s, Val°(s) is rational, with a polynomial length binary encoding. 

Proof. By Proposition Q] there are memoryless optimal strategies: <x for Max, and n for Min. Fixing them 
induces a Markov chain on the states of Q. By LemmaQ] in every fixed bottom strongly connected component 
(BSCC), C, of this finite-state Markov chain, all states v eC have the same value, xc, which is either or 1. 
Denote by W the union of all BSCCs, C, with x c = 1. By optimality of cr and n, Vol is) = K' K (Reach(W)) 
for every s e V. By, e.g., |6] Section 3], this probability is rational, with polynomial length bit encoding, 
since reaching W is a regular event, and every Markov chain is a special case of a MDP. □ 

Proof of Theorem\2\ We will need a couple of preliminary lemmas: 

Lemma 2. Let Q be a MDP with rewards, and s a state of Q. Then for every memoryless strategy o~: 

F*(Mean(> 0)) = P^(LimInf(= +00)) 

In particular, both objectives are equivalent with respect to both the value and optimal strategies. 

Proof. (Sketch.) The inequality < is true for all strategies, since Mean(> 0) c Limlnfi- +00). In the other 
direction, the property that cr is memoryless is needed, so that fixing cr yields a Markov chain on the 
states of Q. In this Markov chain, by Lemma [1] for every BSCC, C, there are x c < y c e {0, 1), such 
that ¥^(Mean(> 0) | Reach(C)) - x c , and ¥^(LimInf(- +00) | Reach(C)) - y c . By random walk arguments, 
considering the rewards accumulated between subsequent visits to a fixed state in C, we can prove that 
yc — 1 => Xc = L see LemmafPflin the appendix. PropositionQ]finishes the proof. □ 

Lemma 3. For an objective O € {Limlnfi— —°°),LimInf(> -co), Limlnfi— +oo),Limlnf(< +°o)}, and a 
maximizing MDP, Q, denote by W the set of all seV satisfying Val°(s) = 1. Then Val°(s) = Val Reach(W) (s) 
for every state s. 

Proof. Proposition[T]gives us a memoryless optimal strategy, cr. By fixing it, we obtain a Markov chain on 
states of Q. We denote by W the union of all BSCCs of this Markov chain, in which at least one state has a 
positive value. By LemmaQ] all states from W have, in fact, value 1. Since W c W, and cr is optimal, we 
get 

Val°(s) = F^(0) = f^(Reach(W')) < ¥^(Reach(W)) < Val Reach(W) (s) 
for every state s. Because O is a tail objective, we easily obtain Val°(s) > Val Reach(w \s). □ 

To prove Theorem [2] we start with the MDP case. By Proposition [1] pure memoryless strategies are 
sufficient for optimizing the probability of all the objectives considered in this theorem, so we can restrict 
ourselves to such strategies for this proof. Given an objective O, we will write W° to denote the set of states 
s with Val°(s) = 1. As @ is a MDP, optimal strategies for reaching any state in W can be computed in 
polynomial time, by Fact [3] If O is any of the objectives mentioned in the statement of Lemma [3] then by 
that Lemma, in order to compute optimal strategies and values for objective O, it suffices to compute the set 
W° and optimal strategies for the objective O in states in W°. The resulting optimal strategy "stitches" these 
and the optimal strategy for reaching W°. 

Proposition3. For every MDP, Q, and an objective O — Limlnfi— -°°X Limlnfi— +°°), or Meani> 0), the 
problem whether s e W° is decidable in P-time. If s € W , then a strategy optimal in s is computable in 
P-time. 
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Proof. (Sketch.) From Lemma[2]we know that Liminf (- +00) is equivalent to Mean(> 0), and thus we only 
have to consider O = Liminf {- -00) and O = Mean(> 0). For a uniform presentation, we assume that Q is a 
maximizing MDP, and consider two cases: O = Mean{> 0), and Liminf (> -00). The remaining cases were 
solved in - Theorem 3.1 there solves the case O = Limlnfi- —°°), and Section 3.3 solves O = Mean(< 0). 
O - Mean{> 0): We design an algorithm to decide whether maxo- F^(Mean(> 0)) = 1, using the existing 
polynomial time algorithm, based on linear programming, for maximizing the expected mean payoff and 
computing optimal strategies for it (see, e.g., US]). Note that, as shown in the appendix (Lemma|7]i, it does 
not matter whether liminf or limsup is used in the definition of Mean{> 0). Under a memoryless strategy 
cr, almost all runs in Q reach one of the bottom strongly connected components (BSCCs). Almost all runs 
initiated in some BSCC, C, visit all states of C infinitely often, and it follows from standard Markov chain 
theory (e.g., |[T4l ) that almost all runs in C have the same mean payoff, which equals the expected mean 
payoff for the Markov chain induced by C. 



Procedure MPO) 
Data: A state s. 

Result: Decide Val Mean(>0) (s) = 1. If yes, return a strategy cr with ¥^(Mean(> 0)) = 1. 

1 repeat 

2 Compute a strategy o~ mp maximizing the expected mean payoff. 

3 if E^ mp (mean payoff) < then return No 

4 Fix <r mp to get a Markov chain on Q. Find a BSCC, C, with mean payoff almost surely positive. 

5 Compute a strategy o~ c maximizing the probability of Reach(C). 

6 foreach v with V^ c (Reach(C)) = 1 do 

7 Remove state v. 

8 if v e C then <r(v) <— <r mp {v) else <x(v) <— cr c (v) 

9 until s is cut off 
10 return (Yes, cr) 



The algorithm is given here as Procedure MP(s). Both step 12 as well as verifying the condition from 
stepH] can be done in P-time, because, as observed above, this is equivalent to verifying that the expected 
mean payoff in C is positive, which can be done in P-time (see [ 15 , Theorem 9.3.8]). Step|5]can be done in 
P-time by Fact[3] To obtain a formally correct MDP, we introduce a new state z with a self-loop, and after the 
removal of any state v in step [7] of the for loop, we redirect all stochastic transitions leading to v to this new 
state z, and eliminate all other transitions into v. The reward of the new state z is set to 0. This will not affect 
the sign of subsequent optimal expected mean payoffs starting from s, unless s has been already removed. 
Thus, the algorithm can be implemented so that each iteration of the repeat-loop takes P-time, and so the 
algorithm terminates in P-time, since in each iteration at least one state must be removed. If the algorithm 
outputs (Yes, cr) then clearly F^(Mean(> 0)) = 1. On the other hand, by an easy induction on the number 
of iterations of the repeat-loop one can prove that if Val Mean( - >0 \s) = 1 then the following is an invariant of 
line[9j either s has been removed, or the maximal expected mean payoff starting in s is positive. In particular, 
the algorithm cannot output No. Thus we have completed the case when O = Mean{> 0). 

O = Liminf (> -00): Recall first the auxiliary objective All(> 0) := {w e Rung | Vn > : £" =0 r(w(j)) > 
0} from the proof of Proposition [TJ and also the sets W m = {v | Val LimI " f{=+ °°\v) = 1}, and W+ = {v | 
Val All{ - 0) (v) = 1). Note that = W Mean(>0 \ by Lemma|2] Finally, recall from the equation © in the proof 
of Proposition [T] that the probability of Liminf (> -00) is maximized by almost surely reaching Woo U W+ 



9 



and then satisfying All(> 0) or Limlnfi- +00). We note that the strategy 0%, optimal for All(> 0), from the 
proof of PropositionQ] can be computed in polynomial time by [3 Theorem 12]. The results on Mean{> 0) 
and Fact[3]conclude the proof. □ 



Now we finish the proof of Theorem [2] Proposition [3] and Fact [3] together establish the MDP case. 
Establishing the NP n coNP upper bound for SSGs proceeds in a standard way: guess a strategy for one 
player, fix it to get a MDP, and verify in polynomial time (Proposition [3j that the other player cannot do 
better than the given value p. To decide whether, e.g., Val°(s) > p, guess a strategy <x for Max, fix it to get 
an MDP, and verify that Min has no strategy ji so that ¥^'"(0) < p. Other cases are similar. □ 
Finally, we show that the upper bound from Theorem[2]is hard to improve upon: 

Proposition 4. Assume that a SSG, Q, a state s, and a reward function r are given, and let O be an objec- 
tive from [Limlnfi— —00), Limlnfi— +co) 7 Mean(> 0)}. Moreover, assume the property (promise) that either 
Val°(s) — 1 or Val°(s) = 0. Then deciding which is the case is at least as hard as Condon's /[5]/ quantitative 
reachability problem w.r.t. polynomial time reductions. 

Proof. The problem studied by Condon [5| is: given a SSG, < H, an initial state s, and a target state t, decide 
whether Val Reach{ '\s) > 1/2. Deciding whether Val Reachm (s) > 1/2 is P-time equivalent. Moreover, we 
may safely assume there is a state t' + t, such that whatever strategies are employed, we reach t or t', with 
probability 1. Consider the following reduction: given a SSG, 'H, with distinguished states s, t, and t' as 
above, produce a new SSG, Q, with rewards as follows: remove all outgoing transitions from t and f , add 
transitions t<—>s and ?' «-» s, and make both t and t' belong to Max. Let r be the reward function over states 
of Q, defined as follows: r(t) :- -1, r(t') := +1 and r(z) := for all other z t {t, t'}. It follows from basic 
random walk theory that in Vaf bnbf(= -°°\s) = 1 if Val ReacH,) is) > 1/2, and Val LimInf{= -°°\s) = otherwise. 
Likewise, Val Limlnf(=+co \s) = 1 if VaP eacW \s) > 1/2, and Val LimInf(=+ °°\s) = otherwise, and identically for 
the objective Mean(> 0) which we already showed to be equivalent to Limlnfi- +°°). □ 

4 Termination 

In this section we prove Theorem Q] We continue viewing OC-SSGs as finite-state SSGs with rewards, 
as discussed in the introduction. However, for notational convenience this time we consider rewards on 
transitions rather than on states. It is easy to observe that Theorem |2] remains valid even if we sum rewards 
on transitions instead of rewards on states in the definition of Limlnfi- -00). We fix a SSG, Q, with state set 
V, and a reward function r. 

Lemma 4. Assume that j > \V\. Then for all states s: Val Term{J) is) = 1 iffVal UmInf< - = -°°\s) = 1. 

Proof. If Q is a maximizing MDP, the proposition is true by results of |2, Section 4]. Consider now the 
general case, when § is a SSG. If Va/ £ * BiV(= " oo) (sf) = 1 then clearly Val Tem(J \s) = 1. Now assume that 
Val erm ^'{s) = 1 and consider the memoryless strategy of player Min, optimal for Limlnfi- -°°), which 
exists by PropositionQ] Fixing it, we get a maximizing MDP, in which the value of Termij) in s is, of course, 
still 1. We already know from the above discussion that the value of Limlnfi- -00) in s is thus also 1 in 
this MDP. Since the fixed strategy for Min was optimal, we get that ya/ i,m/ " /(=_oo) (i) = 1 in @. Thus, if 
Val Term(j \s) = 1 then Val Lim, " f(= -°°\s) = 1. □ 
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Proof of Theorem\T\ For cases where j > \V\, the theorem follows directly from Lemma|4]and Theorem[2] 
If j < \V\ then we have to perform a simple reachability analysis, similar to the one presented in [2|. The 
following SSG, Q' ', keeps track of the accumulated rewards as long as they are between -j and |V| - j: its 
set of states is V := {(u, i) \ u e V, -j < i < \V\ - j}. 

States (u, i) with i e {-j, \V\ - j) are absorbing, and for i £ {-j, \V\ - j) we have (u, i) —> (f, k) iff u — > t 
and k — i + r(u — > f). Every (u, i) belongs to the player who owned u. The probability of every transition 
(w, — » (f, k), u e Vp, is the same as that of u — » t. There is no reward function for @' ', we consider a reachabil- 
ity objective instead, given by the target set R := {(m, -J) \ u 6 V}U {(m, i) \ -j < i <\V\- j, Val LimI " f(= ~°° ) (u) = 
1 }. Finally, let us observe that, by LemmaS Val Reach(R \(s, 0)) = 1 iff Val Term(i) (s) = 1 . Since the size of Q' is 
polynomial in the size of Q, Theorem[T]is proved. □ 

Proposition 5. For all j > 0, s e V, there are pure strategies, <x for Max, and n for Min, such that 

1. IfVal Term( j\s) = 1 then cr is optimal in s for Term( j). 

2. IfVal Term(j) (s) < 1 then sup T P T s '"(Term(j)) < 1. 

Moreover, cr is memoryless, and n only uses memory of size \V\. Such strategies can be computed in P-time 
forMDPs. 

The proof goes along the lines of the proof of Theorem Q] It can be found in the appendix, Section IA.6I 
together with an example that shows the memory use in n is necessary. 

Similarly, both Val Term(j) (s) = and Val Tenn{i \s) > are witnessed by pure and memoryless strategies 
for the respective players. Deciding which is the case is in P-time, by assigning the random states to player 
Max, obtaining a non-stochastic 2-player one-counter game, and using, e.g., J3] Theorem 12]. Finally, we 
note that from Proposition H] and Lemma|U it follows that: 

Corollary 1. Given an SSG, Q, and reward function r, deciding whether the value of the termination ob- 
jective Term(j) equals 1 is at least as hard as Condon's fi5§ quantitative reachability problem, w.r.t. P-time 
many-one reductions. 
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A Appendix 



In the entire appendix, when referring to MDPs and SSGs, we mean finite- state MDPs and SSGs. 



A.l Proof of Proposition [TJ objectives Limlnf(< +00), Mean{> 0) and Mean(< 0) 

An objective O is submixing if for every run w = u 1 v 1 U2V2 • ■ ■ Uk v k • ■ • > sucn that u t and v, are finite paths 
for every i, and such that both u = U1U2 • ■ • Wjt ■ ■ ■ and v = V1V2 • ■ • • • • are also runs, we have w e O ==> 
(u e O V v e O). This notion is taken directly from [10], where it has been defined in a more general setting. 
(See also [2 Section 3] for more details.) By iflOl Theorem 1], for every maximizing MDP and every tail 
submixing objective, O, player Max has a pure and memoryless optimal strategy. 

Lemma 5. The objective Limlnf(< +00) is a submixing and tail objective. 

Proof. Obviously it is tail. As for the submixing property, let {a,}^j be a sequence of numbers, and consider 
an arbitrary splitting of this sequence into two infinite subsequences {bj}^, {CjK^i- For x e {a, b, c] we define 

n 

L x := lim inf > Xj . 

n—*oo ( I 
i=l 

It is easy to verify that if at least one of Lb, L c is finite, or if they are infinite with the same sign, then 
L a > Lb + L c . In particular, if L a < 00 then min{Lb,L c } < 00. Applying this to the sequences of rewards 
finishes the proof. □ 

By changing the lim inf to lim sup in the definition of Mean(> 0) we obtain a new objective: 

n-1 

Mean(> 0) + := {w 6 Rung \ lim sup y r(w(i))/n > 0} . 

„^oo . =0 

Lemma 6. Both Mean(> 0) + and Mean(< 0) are tail and submixing. 

Proof. Both are clearly tail. For the submixing property, let us start with Mean(> 0) + . Let A = {a,}"^ be 
a sequence of numbers, and consider an arbitrary splitting of this sequence into two infinite subsequences 
B = \bi}°l v C = {c,}"[. For a fixed n > 1 denote by rib S n the number of elements of B among the first n 
elements of A. Then, assuming «/, < n 



at + t±i _ .(1 - it) 

nb n n — rib \ n ' 



Consequently, 



< 



n nb n n — nb 

and thus there is x e {b, c] such that 

lim sup < lim sup . 

n— >oo n n—too fl 

The proof for Mean(< 0) proceeds similarly, only with reversed signs. □ 
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Now we show that Mean{> 0) is equivalent to Mean(> 0) + for memoryless strategies. 

Lemma 7. Under a memoryless strategy, cr, for a MDP, Q, with a reward function, r, for almost all runs, w: 

n-1 n-1 

liminf > r(w(i))/n = limsup > r(w(i))/n . 

n — >oo Z — i «_*~, ^ — ^ 

i=0 i=0 

Proof. Fix cr to get a Markov chain on the states of Q. Almost all runs visit some bottom strongly connected 
component (BSCC), and the above equality establishes a prefix independent property. We thus safely assume 
that w starts in a BSCC, C. On C, cr induces an irreducible Markov chain, and applying the Ergodic theorem 
(see Theorem 1 .10.2 from [ 14]) finishes the proof. □ 

Lemma 8. For every maximizing MDP, there is always a pure and memoryless strategy, cr, optimal for 
Mean(> 0). 

Proof. Choose cr to be optimal for Mean(> 0) + . This is possible, because Mean(> 0) + is a submixing and 
tail objective. Observe that since Mean(> 0) c Mean(> 0) + , we have Val Mean(>0) (s) < Val Mean{>0) + (s) for all 
states s. Finally, due to Lemma|7] for all states s: 

Val Mean(>Q) Hs) = ¥^(Mean(> 0) + ) = F^(Mean(> 0)) < Val Mean{>0) (s) . 

□ 

One may be tempted to believe that all of the objectives we study are submixing. This is, however, not 
true for Liminf (- +oo) and Liminf (> -oo), where we have to employ other methods for proving the existence 
of pure and memoryless optimal strategies. 

Lemma 9. The objectives Liminf (— +oo) and Liminf (> —oo) are not submixing. 

Proof. Consider the following finite sequences At over {+1 }, parametrized by k > 1, and defined inductively 
by A\ := +1, -1, and A^+i := + 1, A*, -1. We build an infinite sequence A = {a/}^ by concatenating them, 

A := A\ , A2, A3, Obviously lim inf £" =1 a,- — 0. Now we define two particular subsequences of A, denoted 

by B := {bi}°l v C := {c,-}^, so that 



n n 

lim inf > b t = lim inf > c, = -oo . (2) 

i=i i=i 



We do it inductively by saying for every k> I, whether the k-th element, a^ of A belongs to B, or C. Assume 
we have already decided for each of the first k elements of A whether it belongs to B or C, so that we have 
already defined the finite prefixes b\,...,bM of B, and c\,...,cn of C. Set s' B := 2}=i bj, and similarly 
s' c := Y!j=i c j- If either a^+i = -1 and min^fj s' B > min^ s' c , or a^\ = 1 and min^j s' B < min^ s' c , then 
ak+i belongs to B, otherwise it belongs to C. It is easy to verify that for every number m we have some n such 
that s" B < m, and some n' such that s' l c < m. (In fact, this is the idea behind the construction - the sequences 
B and C take turns in achieving lower and lower partial sums.) Thus (fJJ is true. As the sequence A can be 
easily obtained as a sequence of rewards associated to a run of a very simple MDP with rewards, this proves 
that Liminf {> -oo) is not submixing. 

Similarly goes the proof that Liminf {- +00) is not submixing. Along the lines of the previous proof, 
just consider the following modifications: Take the sequence A = {0,-}°^ to be defined by a; = -1 iff i = 
(mod 3) and a, = +1 otherwise. Further, in the inductive process of building the sequences B and C, denote 
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by Zb '■= \{i < M \ s' B = 0}|, and by zc '■- \{i < N \ s' c - 0}|. Finally, apply the rule of assigning a^+i to B iff 
either a k +\ = -1, Zc ^ Zb, and Sg > 0, or dk+\ = +1 and zc < Zb or = 0. (Here the intuition is that B and 
C take turns in revisiting from above.) It is easy to show that for every m > there is some n > m such that 
s" B = 0, and some n' > m such that s n c = 0. This shows that liminf B = liminf C = 0, while liminf A = oo. 
As a consequence, Liminf (= +oo) is not submixing. □ 



A.2 Proof of Proposition [TJ objective Liminf (= +oo) 

First we set up a tool to analyze finite-state Markov chains with respect to the objective Liminf {- +oo). 
Consider a finite-state Markov chain, At, with the underlying transition graph (S, and with a reward 
function, r : S — » {— 1, 0, +1}. Assume, moreover, that At is irreducible. Also assume that some initial state, 
s, is fixed. We derive here one condition sufficient for P s (LimInf(= +°o)) = 1, and another one sufficient for 
W s (LimInf(= +°°)) = in At. The conditions are parametrized by a choice of a subset R c S of the states of 
At. To formulate them we need the following random variables. 

- V' k , k > 0, t e R returns the time of the k-th visit (thus "V") to t. 

- G{, k > 0, t £ R is the reward gained ("G") between time Vt (inclusive) and the next visit to R (exclusive). 

By standard facts from probability theory, almost all runs in At visit all states infinitely often. Thus these 
random variables are almost surely defined. For a fixed t e R, all the variables G\ are i.i.d., and, as the 
expected time to visit R from t is finite, their common mean, ix,, is well defined and finite. Observe also that 
the values fi t do not depend on the choice of the initial state. 

Lemma 10. For every finite-state irreducible Markov chain, At, and every subset, R, of states, and every 
t e R, considering the numbers /u s , s 6 R, derived as above, the following is true: 

- Iffi s > Ofar all s 6 R then ¥ t (LimInf(- +oo)) = 1. 

- If Us < for all s £ R then V,(LimInf(= +oo)) = 0. 

Proof. We use the following random variables on runs from RuriM(t): 

- V k ,k> 1, the time of the k-th visit to t. (Note: V\ = 0.) 

- Ak, k > 1, the reward accumulated ("A") between time Vk (inclusive) and V^+i (exclusive). 

- S m := ZIU A k , m > 0. ("S" for "sum". Note: S = 0.) 

Since At is a Markov chain, we get that the variables A k are i.i.d., in particular there is some fi such that 
= E t (A k ) for all k > 1.0 

Claim. If all fi s > then fx > 0. If all fi s < then fi < 0. 

Proof. For every s e R and w e Run^if) denote by v s (w) the number of visits to s before the first revisit to 
t: v s (w) = card (\k > | w(k) = s A SI < k : w(J) = t => / = 0}). Then, writing R-{t u ..., t e ], 

re \ t 

^= Z Pf l\ v '^ c i ■TjCj'Mj ■ 

c u ...,c f >0 \j=l ) j=\ 

Since all the coefficients of fi s , s e R are non-negative, the claim is proved. □ 



By E we denote the expectation. 
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For // < standard results on random walks (see [4, Theorem 8.3.4]) yield liminf^co S„ < oo almost 
surely. Immediately, liminf n ^ ra YlUo r ( vv (0) < 00 almost surely, thus P,{LimInf(= +oo)) = 0. 
The case when /u > is more subtle, and we need to introduce two more random variables: 

- M, the least m such that S m > 0. ("M" for "maximum".) 

- M', M' :- V M (the actual number of steps to M). 

Claim, (cf. 01 Theorem 8.4.4]) E,(M) < oo. 

Claim. E,(V M - V k ) = E,(V 2 -V{)<oo for all k > 1. 

Proof. Since M is a Markov chain, we get the equality. By standard results on Markov chains (see lfT4l 
Theorem 1.7.7]) we obtain E,(V2 — Vi) = E f (V2) = (tt(0) -1 where n is an invariant (and positive) distribution 
over the states of M. Thus the inequality follows. □ 

Claim. E,(M') < oo. 

Proof. 

oo 

E,(M') = J] P/M = m) ■ E f (V m ) 

m-1 

oo 

= 2 P,(M = m ) ■ E,((V m - V n -i) + (V B _i - V m _ 2 ) + • • • + (V 2 - Vi)) 

m= 1 

oo 

= p / m = m ) • (w- 1) • E t (V 2 - Vi) 
= (E,(M) - 1) ■ E,(V2 - Vi) 

□ 

As a generalization of the variable M, we define, inductively and for almost all runs from RuriM(t), yet 
another sequence M^, k > of random variables by setting Mo = 0, and Mu+\ to be the least m such that 
S m > S M k - (We get M = Mi.) In other words, M* are the times when maximal rewards were achieved on 
revisit to t. We also define a sequence of events, Zk,k> 1: A run w e RurtM(t) is in Zj iff there is some j, 
V^t ^ 7 < VM t+ i such that the reward accumulated on w(0) • ■ ■ w(f) is 0. ("Z" for "zero".) 

Claim. Xkli ^t( z k) < °°- 

Proof. It takes at least S M k ^ k steps to gain reward starting at time Vm v Since Vu M - ^M k has the same 
distribution as M', we get P,(Z0 < P,(M' > k). Now 

OO OO CO oo oo 

2 p / z *) * X P ' (M ' - fc) = Z Z P ' (M ' = k) = Z fc • P ' (M ' =*)= w) < ^ ■ 

k>l k>\ k>l l>k k>l 

□ 

Thus by the Borel-Cantelli lemma, the probability that Zj occurs for infinitely many k is 0. Consequently 
lim inf„^oo Y/Lo r(w(i)) > for almost all w. Similarly we can prove for all h > that lim inf„^co 2" =0 r(w(i)) > 
h for almost all w. Hence, liminfn-Kx, Ti"=q r ( w (0) = °° almost surely, because a countable intersection of 
sets of probability 1 has probability 1. Thus P,(LimInf(= +oo)) = 1. □ 
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Lemma 11. The finite-memory strategy a from the proof of Proposition\l\is optimal for Limlnf{— +00). 

Proof. Observe that fixing cr yields a finite-state Markov chain, G(cr), on the parallel composition of G and 
the finite automaton used for updating the memory of cr. Let us fix an arbitrary bottom strongly connected 
component (BSCC), C, of G(cr), and denote by R the states of C in which the memory of cr is being reset. We 
are now going to analyze, using Lemma[10l the irreducible MC, Ai, induced by restricting Q(cf) to C. Fix an 
arbitrary s e R. Recall, that the variable Gf, defined before stating Lemma[lO] returns the reward accumu- 
lated between the k-th visit to s and the next visit to R. It is easy to verify that the common mean, p u , of G" k is 
equal to the mean of the stopping time T s introduced in the main text of the proof, ant thus positive. There- 
fore Lemma[10]guarantees that for every state s e R lying in some BSCC we have P s (LimInf{- +00)) = 1. 
Since Q{cr) is finite, almost every run in it reaches some BSCC and every state in it. Because Limlnf{- +00) 
is a tail objective we get P s {LimInf{= +00)) = 1 for every state s. □ 

Lemma 12. In a maximizing MDP, G, with value 1 in all states, given a pure finite-memory strategy cr 
optimal for Limlnf(— +00), a pure and memoryless optimal strategy t can be constructed. 

Proof. As in the proof of Lemma [TT] given a finite-memory strategy, q, we denote by G(g) the finite-state 
Markov chain, states of which are pairs (s, q) where s is a state of G, and q is a state of the finite automaton 
representing the memory of q. Probabilities are obtained in the natural way from q and G. Consider now the 
Markov chain G(cr). The initial state is {sq, qo) where s , qo are initial states of G, and the automaton for cr, 
respectively. For technical reasons we assume that for each q there is at most one s so that (s, q) is reachable 
from Oo, qo). 

If there are two states, q + p, of the automaton for cr, and a state s of G such that both (s, q) and (s, p) 
are reachable from {so,qo), we call both q and p ambiguous. If there is no ambiguous state, cr is already 
memoryless. If there are ambiguous states, we show how to modify cr to get another pure and finite-memory 
optimal strategy cr', such that the associated Markov chain, G(cr'), has fewer ambiguous states. As there are 
only finitely many ambiguous states in the beginning, repeating this process inevitably leads to the optimal 
pure and memoryless strategy t. 

We thus assume that there is a state s of G such that A := {(s, q) \ (s, q) is reachable from (so, qo)} has at 
least two elements. For every fixed choice of (s, q) e A we now define a new finite-memory strategy cr q . This 
is derived by modifying the finite automaton for cr so that all transitions leading to some p, where (s, p) e A, 
are redirected to q. From this, due to our technical assumption, already follows that cr' has fewer ambiguous 
states. It remains to prove that there is some q such that cr q is optimal. 

There are two cases to consider. First, consider the situation where there is (s, q) e A such that with 
some positive probability states from A \ \(s, q)} are visited only finitely often in G(cr). This implies that 
there is a BSCC, S , of G(cr), such that \S n A| < 1. We choose (s,q) so that it minimizes the distance (in 
the transition graph of G(cr)) to S among the states from A. This implies that, starting in (s,q), states from 
A \ {(s, q)} are avoided with some positive probability, 6. We now prove that cr q is optimal. Indeed, let -A 
be the event of not visiting A, and let E be an arbitrary event. Then P°" AE I -A) = P, * JE I -A). On 
the other hand, every run in G(cr) visiting A projects to G(o~ q ), as a run w visiting (s, q). Here we have two 
possibilities. Either (5=1, and we set w q to be the suffix of w starting with the first occurrence of (s, q). Or 
6 < 1, implying that S D A = and thus (s,q) is not in a BSCC. Thus on almost all runs (s,q) is visited 
only finitely many times, and we may define w q to be the suffix starting with the last occurrence of (s, q) 
in w. For every event E we define the set E' := {w e Rung \ w q € E}. Denoting simply by A the event of 
visiting A, it is easy to verify for all E that SE \ A) = JJZ' \ A). Since Limlnf(- +00) is tail, we 
have Limlnfi- +00)' c Limlnfi- +00). Thus almost all runs in G(cr q ) satisfy Limlnf{- +00). 

If the first case does not apply then there must be a BSCC, S , such that \S n A| > 2. Using Lemma [TOl for 
G(cr) restricted to S , with R = A, we obtain that there must be at least one (s, q) e A such that the expected 
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accumulated reward until revisiting A, /i q , is positive. Observe that iS \ A) U {(s, q)} forms a BSCC, S q , in 
Qip~ q ). Using LemmafTOlon Q{cr q ) restricted to S q , with R = {(s, q)}, we obtain that all runs in @{cr q ) started 
in S q satisfy Limlnfi- +00). Because, similarly to the previous case, almost all runs in Qi(J q ) remain either 
unaffected or visit S q , we obtain again that almost all runs in Q{cr q ) satisfy Limlnfi- +00). □ 

A.3 Proof of Proposition [TJ objective Limlnf (> -00) 

Recall that Woo = {s | Val LimI " f(=+oo) is) = 1}, and W + = {s | Val A " ( - Q \s) = 1}. 

Lemma 13. For every maximizing MDP, Q, and its state, s, if Val UmInf(> -°°\s) = 1 then there is a strategy, 
t, such that W s {Reach(W x U W+)) = 1. 

Proof If Val Reacm "\s) = 1 then we are already done for this state. Assume Val Reach{w -\s) < 1. 
Claim. There is a strategy, r, such that 

1. W s (LimInf(> -00)) = 1; 

2. t restricted to Woo is memoryless and P£(Zim7h/(= +00)) = 1 for all v e Woo', 

3. almost all w e Rung^is) which do not visit Woo satisfy 

a 

00 > liminf > K*K0) > -°° • (3) 

n— >oo / i 

i=0 

Proof. Choose a strategy satisfying[T] it must exist by ifTTl Theorem 3.1]. By PropositionQ]for Limlnf (- +00) 
we obtain |2] because Limlnfi- +°°) £ Limlnf (> -00). On the other hand, runs avoiding W M belong to 
Limlnfi— +°°) with probability 0, as a consequence of Lemma [3] for the objective Limlnfi— H prov- 
ing H □ 

We now define an event Infiv) for all states v. Consider a run w, satisfying (O. There must be some 
integer £ such that YIUq r (wif)) > I for all n > 0. Choosing the greatest such I, there is some index, j, such 
that 2j_o r (w(0) - ^- We call the smallest such j to be the minimum of w, and ^ is said to be the minimal 
value of w. According to this, we define inductively the following functions: M\iw) is the minimum of w, 
and, given n :- M^iyv), and the suffix W = w(n+l) w(n+2) • • • of w, we set Mt+i(w) := M\iw') + n + 1 
for £ > 1. Further, := Hi=o r ( w (0)- (See also Figure[T]for an example.) The sequence {m^}^ is non- 
decreasing and, due to the first inequality in (f3]) also bounded, hence it has a well defined finite limit, m. 
Given some state v, we define an event, Infiv), by the condition that there are infinitely many k such that the 
state visited at time Mk is v and = m. 

Claim. For all states v, if P;(/«/(v)) > then v e W+. 

Proof. Fix a state v satisfying the assumption of the claim. Note that due to our choice of r, for all such v, 
Woo is not reached on a run from Infiv). Observe that Infiv) is tail, so by IfTTl Theorem 3.1] there is a state v' 
and a strategy n such that K(Inf(vJ) — 1. In particular, this must be true for v' - v since Infiv) C Reachiv), 
and the objective is tail. Finally, the strategy can be chosen so that almost surely m& = for all > 1. In 
other words, PJ(AW(> 0)) = 1. In particular, v e W+. □ 

Since there are only finitely many states, the union \J V Infiv) has probability 1 on the condition of not 
reaching W x . The last claim showed that ¥ T s iReach(W+) \ Infiv)) = 1 for all v e V with W s (Inf(v)) > 0. This 
proves W s (Reach(Woo U W+)) = 1. □ 



6 A careful reader may suspect a circular dependency since Lemma[3]uses Proposition [JJ This is, however, a correct 
use, since it only uses the proposition for Limlnf i= +00), which has already been proved. 
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Fig. 1. An example of a run and its minima. 

A.4 Proof of Lemma Q] 

Lemma [TJ Let Ai be a finite, strongly connected (irreducible) Markov chain, and O be a tail objective. Then 
there is x € {0, 1 } such that P,(<9) = x for all states s. 

Proof. From every state, s, every other state, t, is visited almost surely. O is tail, thus P J (0) = P,(<9). Assume 
that P s ((3) > for some s, and thus for all s. Since a Markov chain is a special case of a SSG, we directly 
apply [11, Theorem 3.2] and get that P s (0) = 1 . □ 



A.5 Proof of Lemma |2] 

Lemma 14. Let Ai be an irreducible Markov chain with rewards on states, and s a fixed state of Ai. If 
F s (LimInf(= +oo)) = 1 then ¥ s (Mean(> 0)) = 1. 

Proof. We fix s as a starting state. Denote by X^, k > 1 the reward accumulated between the k-th (inclusive) 
and k+l-st (exclusive) visit to s. Since At is a Markov chain, these variables are i.i.d.; we denote by /i their 
common mean. Choosing R = {s} in Lemma [TOl yields that /i > 0. Thus the sums, S [ :- Yi\=\ %k define a 
homogeneous random walk with a positive drift. Define: 

- Vi, k > 1 to be the time of the k-th visit to s (note: V\ = 0), 

- M to be the least k such that S k > 0, and 

- M' := V M - 

By Claim lASI from the proof of LemmafTQlwe know that E S (M') < oo. Further we define: 

- M = 0, and M k , k > to be the least m such that S m > S Mk ^ , and 

- Y k := V Mk+x - V Mk _ ]+l , k>0. (Note: ££ =1 Y„ = V M „ +1 .) 

The variables in the sequence Yk are independent and distributed identically with M', thus we may apply the 
strong law of large numbers (see, e.g., Theorem 1.10.1 in 1 14|) and obtain that almost surely 



lim 

n— >oo 



= lim = Ej(M ) < oo . 



Because S m„+i ^ «, we have almost surely 



.. Sm„+\ .. n 
lim > lim 

„^oo Vm „+i Vm„+i 



1 



>0. 



E S (M') 

Because the leftmost term is equal to the mean payoff, we conclude that P s (Mean(> 0)) = 1. 



□ 
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A.6 Proof of Proposition |5] 



Recall the SSG Q' with the reachability objective R from the proof of Theorem [T] This game emulates 
playing Q until (1) the accumulated reward exceeds |V| - j or a state u with Val L "" In ^ = ^°°\u) — 1 is visited - 
then, by Lemma|4]the players may switch to optimizing the probability of Limlnfi- -oo) instead - or until 
(2) the accumulated reward is —j. Memoryless strategies for Q' induce strategies for Q which use memory 
of size | V| to store the accumulated reward until it exceeds | V| - j or hits —j. From this and from the analysis 
in the proof of Theorem[T]we can see that the strategies <x and n from the statement of the proposition, are 
easy to construct, with the promised time complexity, to be pure and using only a finite memory of size |V|. 

The last thing to show is how to transform <x to some memoryless <j' , preserving the optimality for 
Term(j) in s. Restricted to states u with Val LimI ^~~' x '\u) = 1, cr is already memoryless. Call these u safe. 
We set cr'(u) = ct(m) for every safe u. We further call unsafe those states u which are not safe, but there is 
some strategy r such that F^' T (Reach(u)) > 0. Unsafe states may have been visited with various accumulated 
rewards so far, but from what we already proved it follows that all these accumulated rewards lie between 
-j and \V\ - j (excl.). For an unsafe u, denote by i„ the maximal such accumulated reward, and by w u some 
history along which this was accumulated. It remains to define cr' for unsafe u. We simply set cr'(u) = cr(w u ). 
Since, under cr' , no unsafe state is reached from a safe state, cr' is still optimal for Term(j) in all safe states. 
Consider an unsafe u, and some i, -j < i < i u , and an arbitrary strategy n' for Min. Then in Q, under the 
strategies (cr', n'), on condition that u was visited with an accumulated reward i, almost all runs from u either 
visit a safe state, or the accumulated reward reaches — j at some point, or an unsafe state t is visited, and at 
the same time the accumulated reward is at most i, + i - i u . Thus by double induction, first on |V| - j - i u 
then on i, for all unsafe u and i < i u we have that cr is optimal for Term(i) in u. Thus cr is pure, memoryless 
and optimal for Termij) in s. □ 

An example where memory for Min is needed. This example shows that the strategy n of player Min from 
Proposition|5]may indeed have to use memory. Consider this minimizing MDP: 

- States: v, low, up, back, down; Min owns V± = {v}. 

- Transitions (and their rewards): v—>back (reward 0), v —> low (-1), low— » up (+1), up—>up (+1), 
back — > down (0), backer v (+1), down — > down (—1). 

- From probabilistic states the successor is chosen uniformly among available transitions. 

Then for all j > 0: Val Te ""^\v) < 1, as witnessed by the strategy choosing back as a successor of v when- 
ever the reward accumulated so far is 1, and low in all other cases. However, there are only two pure and 
memoryless strategies for Min: 

- choosing the transition v — » back makes the probability of Term(j) to be 1 for all i > 0; 

- choosing v — > low makes the probability of Term(l) to be 1. 
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